DOCUMENT RESUME 



ED 365 152 



FL 021 770 



AUTHOR 
TITLE 

PUB DATE 
NOTE 



PUB TYPE 



EDRS PRICE 
DESCRIPTORS 



IDENTIFIERS 



Brown^ James Dean; And Others 
Southeast Asian Languages Proficiency 
Examinat ions * 
91 

21p. ; In: Sarinee, Anivan, Ed. Current Developments 
in Language Testing* Anthology Series 25. Paper 
presented at the Regional Language Centre Seminar on 
Language Testing and Language Programme Evaluation 
(April 9-12, 1990); see FL 021 757. 
Reports - Descriptive (141) — Speeches/Conference 
Papers (150) 

MFOl/PCOl Plus Postage. 

Cambodian; Cloze Procedure; Comparative Analysis; 
Dictation; Indonesian; Interviews; ''language 
Proficiency; ^Language Tests ; Listening 
Comprehension; ^Second Languages ; Tagalog; '^Test 
Construction; Test Format; Test Reliability; Test 
Validity; Thai; ^Uncommonly Taught Languages; 
Vietnamese 

ACTFL Proficiency Guidelines; Southeast Asian Summer 
Studies Institute 



ABSTRACT 

The design, administration, revision, and validation 
of the Southeast Asian Summer Studies Institute proficiency 
examinations are reported. The examinations were created as parallel 
language proficiency ' est.o in each of five languages: Indonesian, 
Khmer, Tagalog, Thai, and Vietnaijese. Four tests were developed in 
each language: multiple-ch'^i listening comprehension, interview, 
dictation, and cloze. The interview and listening tests were each 
designed to assess all of the levels of language ability in the 
American Council on the Teaching of Foreign Languages (ACTFL) 
proficiency guidelines. The study reported here (involving 218 
students) explored the score distributions for each test on the 
proficiency batteries for each language, as well as differences 
between distributions for the pilot and revised versions. Relative 
reliability estimates for pilot and revised versions and the 
relationships of tests across languages were also compared. Based on 
the analyses it is concluded that the tests in each of the five 
examinations are reasonably well-centered and reliable, and 
distributions are adequate. (MSE) 



* * >V A Vc A * y< * *3V * * * 5VVc Vc A A * 5V Vc * sV * * A Vc * A A Vc * 5V * * * V< * *ifV * * * * * 

* Reproductions supplied by EDRS are the best that can be made * 

* from the original document. ^ 

Vc * * >V * * * Vc * * * V< >V >V >V >V * >V Vc*>V A * >V VSf V< V< VoV * A:fe* :fe:?c * Vc A 



••PERMISSION TO REPRODUCE THIS 
MATERIAL HAS BEEN GRANTED BY 



I/) 



1^ 



TO THE EDUCATIONAL RESOURCES 
INFORMATION CENTER (ERIC).* 



U.S. OeMRTMEHTOF eOOCATtOH 

Office of Educafjonai Research end >mpfOvem«r»t 

EDUCATIONAL RESOURCES INFORMATION 
CENTER iERiC) 

^ This document has been reproduced as 
received trom the person or organization 
originating it 

□ Minor Changes have oeen made to improve 
reproduction quality 

« Points of view or opinions slated m this docu- 
ment do not necessarily represent official 
OERi position or policy 



SOUTHEAST ASUN 
LANGUAGES PROFICIENCY EXAMINATIONS 



James Dean Brown 

KGaryCook 
OiarUs Lockhart 
TeresUa Ramos 



ABSTRACT 



This paper reports on the design, administration, revision and validation of 
the Southeast Asian Summer Studies Institute (SEASSI) Proficiency 
Examinations. The goal was to develop parallel language proficiency 
examinations in each of five languages taught in the SEASSI: Indonesian 
Khmer. Tagalog, Thai and Vietnamese. Four tests were developed for each of 
these languages: multiple-choice listening, interview, dictation and cloze test 
To maximize the relationships among these examinations and the associated 
curricula the interview and listening tests were each designed to assess all of the 
levels of language ability which are described in the ACTFL Proficiency 

Guidelines from "novice" to "advanced-plus." 

This study (N = 218) explored the score distributions for each lest on the 

proficiency batteries for each language, as well as differences between the 

distributions for the pilot (1989) and revised (1989) versions. The relative 

reliability estimates of the pilot and revised versions were also compared as were 

the various relationships among tests across languages. 

The results are discussed in terms of the degree to which the scores on the 

strategies here are generalizable to test development projects for other 

Southeast Asian languages. 



^ Each year since 1984, a Southeast iSian Summer Studies Institute 

' (SEASSI) has been held on some university campus in the United States. As the 

^ name implies, the purpose of SEASSI is to provide instruction m tho lesser 
taught" languages from Southeast Asia. In 1988, SEASSI came to the university 
of Hawaii at Manoa for two consecutive summers. Since wc found ourselves 
J with several language testing specialists, a strong Indo-Pacific Language 



210 



BEST COPY AVAILABLE 



department, and two consecutive years to work, we were in a unique position lo 
develop overall proficiency tests for a number of the languages taught in SEASSI 
- tests that could then be passed on to future SEASSls. 

The central purpose of this papc. is to describe the design, production, 
administration, piloting, revision and validation of these Southeast Asian 
Summer Studies Institute Proficiency Examinations (SEASSI). From the outset, 
the goal of this project was to develop overall language proficiency examinations 
in each of five languages taught in the SEASSI: Indonesia, Khmer, Tagalog, 
Thai and Vietnamese. The ultimate objectives of these tests was to assess the 
grammatical and communicative ability of students studying these languages in 
order to gauge their overa!J proficiency in the languages. It was decided early 
that the tests should be designed to measure all of the levels of language ability 
which are described in the ACTFL Proficiency Guidelines from "novice" to 
"advanced-plus" for speaking and listening (see Appendix A from ACTFL 1986, 
Liskin-Gasparro 1982, and/or ILR 1982). Though the ACTFL guidelines are 
somewhat controversial (eg. see Savignon 1985; Bachman and Savignon 1986), 
they provided a relatively simple paradigm within which we could develop and 
describe these tests in terms familiar to all of the teachers involved in the 
project, as well as to any language teachers who might be required to use the 
tests in the future. 

The central research questions investigated in this were as follows 

(1) How are the scores distributed for each test of the proficiency 
battery for each language, and how do the distributions differ 
between the pilot (1989) and revised (1989) versions? 

(2) To what degree are the tests reliable? How does the reliability differ 
between the pilot and revised versions? 

(3) To what degree are the tests intercorrelated? How do these 
correlation coefficients differ between the pilot and revised versions? 

(4) To what degree are the tests parallel across languages? 

(5) To what degree are the tests valid for purposes of testing overall 
proficiency in these languages? 



(6) 



To what degree are the strategies described here generaiizable to 
lest development projects for other languages? 



METHOD 



A test development project like this has many facets. In order to facilitate 
the description and explanation of the project, this METHOD section will be 
organized into a description of the subject used for norming the tests, a section 
on the materials involved in the testing, an explanation of the procedures of the 
Malistical pr(Kcdures used to analyze, improve and reanalyze the tests. 



Subject 

A total of 228 students were involved in this project: 101 in the pilot stage 
of this project and 117 in the validation stage. cpacci 

The 101 students involved in the pilot stage were all students in the i>tAbi>I 
program durini.^ the summer of 1989 at the University of Hawaii at Manoa. They 
were enrolleci in the first year (45.5%), second year (32-7%) and third year 
(21.8%) language courses in Indonesian (n = 26), Khmer (" = ^D. Tagalog (n 
= 14) Thai (n = 17) and Vietnamese (n = 23). There were 48 females (47.5%) 
and 5^ Males (52.5%). The vast majority of these students were native speakers 
of English (80.7%), though there were speakers of other languages who 

narticipaled (19.3%). , , . . , , 

The 117 students involved in the validation stage of this test development 
nroiecl were all students in the SEASSl program during summer 1989. They 
were enrolled in the first year (48.7%), second year (41 0%) a"dlh|rd year 
(10.-,%) language courses in Indonesian (n = 54), Khmer (n = 18), Tagalog (n 
= 10) Thai (n = 23) and Vietnamese (n = 12). There were 57 females (48.,%) 

and 60 males (51.3%). . , „ ^ 

In general, all of the groups in this study were intact classes. To some 
degree. The participation of the students depended on the cooperation of their 
teachers Since that cooperation was not universal, the samples in this project 
can only be viewed as typical of volunteer groups drawn from a summer 
intensive language study situation like that in SEASSl. 



Materials 



There were two test batteries employed in this project. The test of focus 
was the SEASSIPE. However, the Modem Lang-age ApUMde TeH (MLAl), 
developed by Carroll and Sapon (1959), was also adminisle.cd. Each will be 

^"''"'^cdX^'ofthe SEASSIPE. The SEASSIPE battery for each language 
presently consisted of four tests : multiple-choice listening, oral interview 



212 ^ 



procedure, diclalion and cloze lest. In order lo make the tests as comparable as 
possible across the five languages, they were all developed first in an English 
prototype version. The English version was then translated into the target 
language with an emphasis on truly translating the material into that language 
such that the result would be natural Indonesian, Khmer, Tagalog, Thai or 
Vietnamese. The multiple-choice iistening test presented the students with aural 
statements or questions in the target language, and Ihcy were then asked what 
they would say (given four responses lo choose from). The pilot versions of the 
test all contained 36 items, which were developed in 1988 on the basis of the 
ACTFL guidelines for listening (see APPENDIX A). The tests were then 
administered in the 1988 SEASSI. During 1989, the items were revised using 
dislractor efficiency analysis, and six items were eliminated on the basis of 
overall item statistics. Thus the revised versions of the listening test all 
contained a total of 30 items. 

The omJ intervUw procedure was designed such thai the interviewer would 
ask students questions at various levels of difficulty in the target language (based 
on the ACTFL speaking and listening guidelines in APPENDIX A). The 
students were required to respond in the target language. In the pilot version of 
the test, the responses of the students were rated on a 0-108 scale. On each of 
36 questions, this scale had 0 to 3 points (one each for three categories: 
accuracy, fluency, and meaning). On the revised version of the interview, 12 
questions were eliminated. Hence on the revised version, the students were 
rated on a 0-72 scale including one point each for accuracy, fluency and meaning 
based on a total of 24 interview questions. 

The dictation consisted of an eighty word passage in the target language. 
The original Engli.^h prototype was of approximately 7th grade reading level 
(using the Fry 1976 scale). The passage was read three times (once at normal 
rate of speech, then again with pauses at the end of logical phrases, and finally, 
again at normal rate). Each word that was morphologically correct was scored 
as a right answer. Because these dictations appeared to be working reasonably 
well, only very minor changes were made between the pilot and revised versions 
of this test. 

The cloze test was based on an English prototype of 450 words at about the 
7th grade reading level (again using the Fry 1976 scale). The cloze passage was 
created in the target language by translating the English passage and deleting 
every 13th word for a total of 30 blanks. The pilot and revised versions of this 
test each had the same number of items. However, bl:;nks that proved 
ineffective statistically or !inguistically in the pilot versions v ere -nanged to more 
promising positions in the revised tests (see Brown 1988b Tor n/ore on cloze lest 
improvement strategies). 

As mentioned abo\e, these four tests were developed for each of five 
languages taught in the SEASSI. To the degree that it was possible, Ihcy were 



213 



5 



aao» 1 wes » tha,, lo, msuncc, a sc»e of 50 0» •he »tc P^^. 
TO i„vo.ig... f .^fi^^'-^^^Tjr .inL,o. » 

SSc,i»Mang.ago.g«ca 

All of the results of the SEA..J>1 ^roucicncy primarily to 

experimental. Hence the results of the Pl'"/^ /J^^f/JJ^J^Sn of each test. 
Jprovc the tests and ^^'^^^ ^^^'"^ ^ 

The scores were reported to the t^^^*^''"''" " P -^^j any way, to use the 
students. However the ^^^^^"^ J^^^iX the e^^^^^^^^^ °f 

'"^%sc.paon of .H. The shon 

M^^speinngdues, "^^^f "f^^.^nffcrences in language learning 
The MLAT was included to control lor ai investigating the 

aptitude across the five language groups and thereby PJ J, 
cUa«cncy of the tests across languag^Jhe^^^^ 

aptitude test. It 'l"'«;:3t.l ^ wC^ e^^^^^^^^^^ and did not affect the 
classroom. 1 V*^'-' ?hf sTo es^^^^^^ national percentile ranking were 
students' grades m any way. Th^ scopes »^ ^^^^3 represent 

reported individually to the .earning foreign languages, 

analyses reported below. 



Procedures 

The ovora,, plan to, ,Ms p,oj=c, p-occode. on „hc.«lo in f.n, n,.b s,.g»s and . 

number of smaller steps. 

M • ThP te^ts were designed during June 1988 al the 
Stage om: Design. The tests °J » ^.^^^.i^ 



the Indo-Pacific Languages department and in SEASSl). J D Brown and C 
Lockhart were responsible for producing a prototypes into each of the five 
languages. J D Brown took primary responsibility for overall test design, 
administration and analysis. 

5tflgr two: Producdon. The actual production of the tapes, booklets, answer 
sheets, scoring protocols and proctor instructions took place during the hist week 
of July 1988 and the tests were actually administered in SEASSl classes on 
August 5, 19H8. This slape was Ihc responsibilily of T. Ramos with Ihc help <)f C\ 
Lockhart. 

Stag^ three: Validation. The on-going validation process involved the 
collection and organization of the August 5th data, as well as teacher ratings of 
the students' proficiency on the interview. Item analysis, descriptive statistics, 
correlational analysis and feedback from the teachers and students were all used 
to revise the four tests with the goal of improving them in terms of central 
tendency,dispersion, r-lir^^ility and validity. The actual revisions and production 
of new versions of the tests took place during the spring and summer of 1989. 
This stage was primarily the responsibility of J D Brown with the help and 
cooperation of H Gary Cook, T R;imoa and the SEASSl teachers. 

Stags four Final Product Revised versions of these tests were administered 
again in the 1989 SEASSl. This was primarily the job of H G Cook. A test 
manual was also produced (Brown, Cook, LocKhart and Ramos, unpublished 
ms). Based on the students' SEASSl performances and MLAT scores from both 
the 1988 and 1989 SEASSl, the mr.nual provides directions for adminif tering the 
tests, as well as discussion of the tt^st development and norming procedures. The 
discussion focuses on the value of these new measures as indirect tests of 
ACTFL proficiency levels. The manual was developed following the standards 
set by AERA, APA and NCME in St4mdardsfor Educaiional and Psycholoffcal 
Testing (see APA 1985). The production of all tests, answer keys, audio tapes, 
answer sheets, manuals and reports was the primary responsibility of J D Brown. 



Analyses 

The analyses for this study were conducted using the QuattroPro 
spreadsheet program (Borland 1989), as well as the ABSTAT (Bell-Anderson 
1989), and SYSTAT (Wilkinson, 1988) statistical program. These analyses fall 
into four categories: descriptive statistics, reliability statistics, correlational 
analyses, and analysis of covariance. 

Because of the number of tests involved when we analyzed four tests each 
in two versions (1988 pilot version and 1989 revised version) for each of five 
languages (4x5 x 2 = 40), the desaipdye statistics reported here are lim.ied to 
the number of items, the number of subjects, the mean and the standard 



215 



differences in initial language aptitude (as measured by the Ml^T). the alpha 
significance level for all statistical decisions was set at .OS. 



RESUL'IS 

Summary descriptive statistics are presented in Table 1 for the pilot and 
revised vers o'ns o the four tests for each of the five languages. The languages 
ar 1 sted a roi^ top of the table with the mean and standard dev,at.on for 
Ta^h R vcn di ectly below the language headings. The mean P'ovdes an 
indicafion of the overall central tendency, or typical behavior of a group and he 
atda d dS^^^^ gives an estimate of the average distance of students from he 
rlrr serB own lOSSa for more on such statistics). The versions (le. the pilot 
mean (.see Brown i /ooa lu. revised versions administered in 

versions administered in summer of 1988 or the f^^'^<=<^ T^'^ , -j. ^^ 

summer of 1989) and tests (Listening, Oral. Interview, Dictation and Cloze 1 es ) 
a^lTbcled down the left side of the table along with the number of items (k) m 
parentheses. 



t^Cisi STO 



rc.^.s ..iA vir i^v... 6.7^ n.-. v=> '"-s^ 

-.^^.^ -.b.:3.-^ .^..^r-.-^ ^^--^ "'^ -..oeia.n 

'-.^ fl 12 ^ '-^ :.va3 lo.p^ ".sd. 29.67 12. k 

l« - 90) ^ ^ ^ n-r -»«! 16 '.07 15.64 6.57 

Clo.-e tst 13.-=.: ''■CO 4.06 12.23 ^ l«>.>- 

r.;. -.^ ..^ m.^ ^.«> '0.32 ^.72 ^..^ ..m 

i^cL":. ,.70 ^.^ ^-^ ^-^ 

c\.:x. ^.c- w.^ '^-^^ '-'^ '^-^ '-"^ 

lu ^ 30) 



216 



BEST COPY AVAILABLE 



8 



Notice that, for each lest, there is considerable variation across versions and 
languages not only in the magnitude of the means but also among the standard 
deviations. It seems probable that the disparities across versions (1988 and 
1989) are largely due to the revision processes, but they may in part be caused by 
differences in the numbers of students at each level of study or by (>ther 
differences among the samples used during the twd summers. 

Table 2 presents the reliabilities for each test based on the scores produced 
by the groups of students studying each of the languages. A reliability coefficient 
estimates the degree to which a test is consistent in what it measures. Such 
coefficient can range from ().(K) (wholly unreliable, or inconsistent) 1<> I.(M) 
(completely reliable, or KM) pcrccnl miisislciil), and can lake on all ol liu-. values 
in between, as well. 

Notice that, once again, the languages are shown across the top of the table 
with two types of reliability, alpha and k-R21, labeled just under each language 
heading. You will also find that the versions (1988 or 1989) and tests are again 
labeled down the left side of the table. 

^r4^{ . : 'f«sifT ifht rb. i/^ni itv fm f^<>^ 





■•o 


.<>7' 








.•'?6 




. "8 


.82 




O^al 


G"> 


.96 


. ^ 








. "6 


. r6 


.98 




titr ;n 


1 a 


.90 










• a 




« t 




CKve 


• 


. 7r 








.Ul 






« 




























e. > 






. "6 




. 


.91 




. 76 




Oral I^w 


.91 


.86 








.90 


.97 


,93 


.78 




Oic ta» :cn 


• a 


.81 


• a 




• • 


.9Q 


at 


.78 


** 


.SI 


Cloze Tst 


.77 


.63 




.60 


.96 


.8!) 


.99 


.63 


.84 


."6 



t Mot calcuUted. 
• a r<!t applicable. 



As mentioned above, the reliability estimates reported in Table 2 are based 
on Cronbach alpha and on the K-R21. Cronbach alpha is an algebraic identity 
with the more familiar K-R20 for any test which is dichotomously scored (eg. the 
listening and clo7X tests in this study). However, for any test which has a 



217 

BEST COPY AVAILABLE 



Cronbach 1970) 




>.,■... «' ■'^ 

c„„ ,« .VI .m .m M .m .i« .u. .51. -J" 

^ISO lit 

M M 5*1 

„c,..,- .* * 



f I < .« ^ 



. . *u« cnAQ<:iPF tests on both versions were 

wS) A cocmcient of -U intolM .tot .he, are strongly -dalrf. W ■» 

wiuce^hat the languages arc labeled across the top with Listening (L) 
versions (1988 or IVX^) ana iLsiM^ r.o^nh^Mahle remember that each 

and Listening tesis in Indonesian in 1988 p.lot version. 



218 



10 



Following some of the correlation coefficients in Table 3, there is an 
asterisk, which refers down below the table to p < .05. This simply means that 
these correlation coefficients are statistically significant at the .05 level. In other 
words, there is only a five percent probability that the correlation coefficients 
with asterisks occurred by chance alone. Put yet another way, there is a 95 
pel cent probability that the coefficients with asterisks occurred for other than 
chance reasons. Those coefficients without asterisks can be interpreted as being 
zero. 

Recall that, in " able 1, there was considerable variation in the magnitude of 
the means and standard deviations across languages and versions. Tabic 4 shows 
ihe results of an analysis of covariance procedure which used language 
(Indonesian, Khmer, Tagalog, Thai and Vietnamese) as a categorical variable 
and MLAT language aptitude scores as a covariate to determine whether there 
were significant differences across languages for the mean lest scores (Listening, 
Interview, Dictation and Cloze treated as repeated measures). 

TABJ: 4: ANOLVSIS OF CCKW!«CE ACRDS5 f€PEATED rEPSUeS (TCSTS) 



SOLlWCE SS 



BETUCEN SUBJECTS 

Li>t43ufiC^ 5197.197 A 7Q9.2<'9 7 . 246* 

n^r (COw"«'Iatej 256. oia i r?<D.oi4 2.322 

SUBJECTS WITHIN GFCLPS 628^.642 57 liO.2*'^ 
WITHIN SLBJECTS 

LfitGJfiCE 7156.650 12 596.387 18.196* 

rLAT (OMapiATE) 80.513 3 26.B38 0.819 

SUBJECTS WITHIN «LPS 5604.643 171 32.776 



•p < .05 



In Table 4, it is important to realize that the asterisks next to the F ratios 
indicate that there is some significant difference among the means for different 
languages across the four tests. This means in effect that at least one of the 
differences in means shown in table 1 is due to other than chance factors (with 
95 percent certainly). Of course, many more of the differences may also be 
significant, but there is no way of knowing which they are from this overall 
analysis. It should suffice to recognize that a significant difference exists 
somewhere across languages. The lack of asterisks after the F ratios for the 
MLAT indicate that there was no significant difference detected language 
aptitude (as measured by MLAT) among the groups of students taking the five 
languages. 



219 



1-^ 
i 



Since analysis of covariancc is a fairly controversial procedure, iwo 
additional steps were tr'ten: 

(1) First, the assumption of homogeneity of slopes was carefully 
checked by calculating and examining the interaction terms bctore 
performing the actual analysis of covariance. The interactions were 
mil Umnd to be significant. 

(2) Second, multivariate analyses (including, Wilks' lambda, PiUai trace, 
and HotcUing-Uwley trace) were also calculated. Since they led to 
exactly the same conclusions as the univariate statistics shown m 
Table 4, they are not reported here. 

Thus the assumptions were found to be met for the univariate analysis of 
ov" riance procedures in a repeated measures design, ano the -su ks -re 
Jurther confirmed using multivariate procedures It is therefore with a fair 
amount of confidence that these results are reported here. 



TEST \.£\£L 



reON STO 



._.,.^..<, 1S.V«. IV^^ ^-j^ 

^ :::: "-.t^^ i--^ - 

:.rdv*r 57.1000 12-2734 12 

O.c.a.xcx, is.v... 16.*0^ 5.3573 32 



65,3633 6*9^*3 



23.''167 



449P 12 



One other important result was found in this study: the tests do^PP^" »° 
reflecTthe differen^s in ability found between ^I'^f 3^^^^^^^^^^ 
an important issue for overall proficiency tes s l.ke the SEASSIPE because hey 
sLld be sensitive to the types of overall difference, in language ability that 



220 

^ BEST COPV AVAILABLE '2 



would develop over lime, or among individuals studying al diffcrenl levels. 
While this differential level effect was found for each of the languages, it is 
summarized across languages in Table 5 (in the interests of economy of space). 
Notice that, with one exception, the means get higher on all of the tests as the 

level of the students goes up from first to seeond to third year. The one anomaly 
Is between the first and second years on the oral interview. 



DISCUSSION 

The purpose of this section will be to interpret the results reported above 
with the goal of providing direct answers to the original research questions posed 
at the beginning of this study. Consequently, the research questions will be 
restated and used as headings to help organize the discussion. 

(1) Howarr Uui scorn distributed for each test of the pmfu:i^ 

languor and how do the distributions differ between the pilot (19S9) and 
revised (19H9) versions? 

The results in Table 1 indicate that most of the current tests are reasonably 
welNcentered and have scores that are fairly widely dispersed about the centra) 
tendency. Several notable exceptions seem to be the 1989 OrM Interviews for 
Indonesian and Khmer, both of which appear to be negatively skewed (providmg 
classic exiiniples of what is commonly called the ceiling effect - see Brown l*)88a 
for further explanation). It is difficult, if not impossible, to disentangle whether 
the differences found between the two versions of ihc lest (1988 and 1989) are 
due to the revision processes in which many of the tests were shortened and 
improved, or to differences in the samples used during the two SHASSls. 

(2) To what deffee are the tests reliable? How does the reliability differ between 
the pilot arid revised versions? 

Table 2 shows an array of reliability coefficients for the 1988 pilot version 
and 1989 revised tests that are al! moderate to very high in magnitude. The 
lowest of these is for the 1989 Indonesian Listening test. It is low enough that 
the results for this test should only be used with extreme caution until it can be 
administered again to determine whether the low reliability is a result of bad test 
design or some aspect of the sample of students who took the test. 

These reliability statistics indicate that most of the tests produce 
reasonably consistent results even when they are administered to the relatively 
homogeneous population of SEASSI students. The revision process appears to 



221 



13 



have generally, though not universally, improved test rel.ab.l.ty either m terms o 
p oducing higher reliability indices or approximately equal "t-at". but or 
shorter more efficient, versions. The listening tests for Indonesian and Tagal<^ 
are worrisome because the reUabilities are lower in the revised than m the pilo 
^.mZ aJbecause they are found among the 1989 results. However, .t .s 
SSLt to rememler that these are fairly sho.t tests and that they are bemg 
XniSrcd to relatively restricted ranges of ability in the various language 
nvo d These are both' important factors because, all things bcmg^-U 
Tor test will be less reliable than a long test, and a restricted range of talent will 
frlce lower reS^^^^ estimates than a wide one (for further explanation and 

"SirortS"^ 

Thisis t^ical. K-R21 is a relatively easy to calculate reliabih.y "fmate but it 
Is "ally Zderestimates the actual reliability of the test see for instance, the 1989 
Revised Khmer and Thai clo/.e tests reliabilities in Tabic 2). 

(3) To what degree are the tests intercorrelated? How do these correlation 
coeJTicierUs differ between the pUo( and revised versions? 

In most cases the correlation coefficients reported in Table 3 indicate a 
suror Lgrhigh degree of relationship among the tests. The one systematic and 
IrTne SLtfon^^^ set of coefficients found for Thai. It is importan to note 
£ eseTorrTlSion coefficients for Thai based on very -a s^^^^^^^^ 
mostly to the fact that students at the lowest level were not taught to write m 
?hai and that these correlation coefficients were not statistically significant a 
S< 05 level. They must therefore be interpreted as correlation coefficients 
that probably occurred'by chance alone, or simply as correlations of zero. 

(4) To what deffte art the tests parallel across languages? 

one possible cause for these differences is that the tests have changed 
Une possioic *.->"^ ^ started out as 



222 



14 



potential cause of the statistically significant differences reported in Tables 1, 4» 
and 5 is that there may have been considerable variations in ihz samples used 
during the two summers. 

(5) To what deffre wr the tests valid for purposes of testing overuU proficietuy in 
these languages? 

The intercorrelations among the tests for each language (see Tabic 3) 
indicate that moderate to strong systematic relationships ewst among many of 
the tests in four of the five languages being tested in this project (the exception is 
Thai). However, this type of correlational analysis is far from sufficient for 
analysing the validity of these tests. If there were other well established tests of 
the skills being tested in these languages, it would be possible to administer 
those criterion tests along with the SEASSIPE tests and study the correlation 
coefficients between our relatively new tests and the well -established measures. 
Such information could then be used to build arguments for the criterion- 
related validity of some or all of these measures. Unfortunately, no such well- 
established criterion measures were available at the time of this project. 

However, there are results in this study that do lend support to the 
construct validity of these tests. The fact that the tests generally reflect 
differences between levels of study (as shown in Table 3) provides evidence for 
the construct validity (the differential groups type) of these tests. 

Nevertheless, much more evidence should be gathered on the validity of the 
various measures in this study. An intervention study of their construct validity 
could be set up by administering the tests before and after instruction to 
determine the degree to which they are sensitive to the language proficiency 
construct which is presumably being taught in the course. If, in future data, 
correlational analyses indicate patterns similar to those found here, factor 
analyses factor analysis might also be used profitably to explore the variance 
structures of those relationships. 

The point is that there are indications in this study of the validity of the tests 
involved. However, in the study of validity, it is important to build arguments from 
a number of perspectives on an ongoing basis. Hence, in a sense, the study of 
validity is never fully complete as long as more evidence can be gathered and 
stronger arguments can be constructed. 

(6) To what degree are the strategies described here generalizable to test 
deveiopment projects for other languages 7 

From the outset, this project was designed to provide four different types of 
proficiency tests — tests that would be comparable across five languages. The 
intention was to develop tests that would produce scores that were comparal)!c 

223 

15 

ERLCpM— — — — 



across languages such that a score of 34 would be roughly comparable in 
Indonesian. Khmer. Tagalog. Thai and Vietnamese. Perhaps this entire aspect 
of the project was quixotic from the very beginning. Recall that the process 
began with the creation of English language prototypes for the hslenmg test oral 
interview, dictation and cloze procedure. These prototypes were then translated 
into the r.vc languages with strict instructions to really translate them. le. to 
make them comfortably and wholly Indonesian, Khmer, Tagalog. Thai and 
Vietnamese. While the very act of translating the passages in five different 
directions probably affected their comparability across languages, they probably 
remained at least roughly the same at this stage of development. Then, during 
the summer of 1088. the tests were £.dministered. analyzed and revised 
separately using different samples of students with the result that the tests 
further diverged in content and function. 

We now know that the use of English language prototypes for the 
development of these tests may have created problems that we did not foresee. 
One danger is that such a strategy avoids the use of language l_hat is authentic in 
the target language. For instance, a passage that is translated from English tor 
use in Khmer cloze test may be topic that would never be discussed m the target 
culture may be organized in a manner totally alien to Khmer, or may simply 
seem stilted to native speakers of Khmer because of its rhetorical siruciure. 
These problems could occur no matter how well-translated the passage might be. 

Ultimately, the tests did not turn out to be similar enough across languages 
to justify using this translation strategy. Thus we do not recommend its use m 
further test development projects. It would probably have been far more 
profitable to use authentic materials from the countries involved to develc? tests 
directly related to the target languages and cultures. 



CONCLUSION 



In summary, the tests in each of the five SEASSI Proficiency Bcammations 
appear to be reasonably well-centered and seem to adequately disperse the 
students- performance. They are also reasonably reliable. Naturally, future 
research should focus on ways to make the tests increasingly reliable and further 
build a case for their validity. Thus the final versions of the tests can be passed 
on to future SEASSIs at other sites with some confidence that any decisions 
based on them will be reasonably professional and sound. It is also w,th some 
confidence that the tests will be used here at the University of Hawaii at Manoa 
to test the overall proficiency of students studying Indonesian, Khmer, Tagalog, 
Thai' and Vietnamese. However, the process of test development and revision 
should never be viewed as finished. Any test can be further improved and made 



224 

16 



to better serve the population of students and teachers who are the ultimate 
users of such materials. 

One final point niusl be stressed: we could never have successfully carried 
out this project witli(;nt the cooperation of the njany languai^e teachers who 
volunlccred their lime while carrying out other duties in the Indo-Pacific languages 
department, or the SHASSIs held al University of I lawaii al Manoa. We owe eadi 
of these language teachers a personal debt of gratitude. Unfortunately, we can tmly 
thank them ;is a group for their professionalism and hard work. 



ACTFL (1^86). ACTFL proficiency guidelines. Hasitnfis-on-Hudson, NY: 
American Council on the Teaching of Forcigfi Languages. 

Andcrson-Bcli. (1989). ABSTAT Parker, CO: Anderson-Bell. 

APA. (1985). SUindards for educaUonal and psycholofjica! testing. Washington, 
DC: American Psychological Ass(Kiation. 

Bachnian, L <^ S Savigfion. (1986). Tixe vvaittadon of communicative language 
proficiency: a critique of (he ACTFL oral inten'iew. Afodem iMnguagc Jimmal 
70, 380-m. 

Borland. (1989). Quattn>Pm. Scottsl alleyA^i: Borland International. 

Brown, J D. (198.3). A closer look at cloze: validity and reliability In J W Oiler, 
Jr(Ed). Issues in lanffUJge (esdng. Cambridge, MA: Newbury^ Hoit.se. 

Brown, J D. (1984). A cloze is a cloze is a cloze? In J Handscombe, R A Orem, 
and B P Taylor (Eds). On TESOL 'S3: the question of control. Washin^^ton, 
DC: TESOL. 

Brown, J D. (1988a), Understanding research in second language learning: A 
teacher's guide to statistics and research desifft. London: Cambridge University. 

Brown, J D. (1988b). Tailored cloze: improved with clas.sical item analysis 
techniques. iMnffusge Testing 5, 19-31. 

Brown, J // G Cook, C Lockhart and T Ramos (Unpublished ms). The 
SEASSI Pnficiency Examination Technical Manual. Honolulu, 111: University 
of Hawaii at Manoa. 




Carroll, J B and S M Sapon (1959). Modem language aptUude test New York: 
Tlte Psychological Corporation. 

CronbacKLJ. (1970). Essentials of psychological testing {3rd ed). New York: 
Harper and Row. 

Ebel,RL. (1979). Essentials of educational measurement (3rd ed). Englewood 
Cliffs, NJ: Prentice-Hall. 

Fry, E B. ( 1976). Fry readability scale (extended). Providence, Rl: Jamestown 
Publishers. 

ILR ( 1982). Interagency Language Roundtable Language Skill Level Descriptions: 
Speaking, appendix B ,'n Liskin-Gasparro (1982). 

Kuder, G F and M W Richardson. (1937). The theory of estimation of test 
reliability. Psychometrika, % 151-160. 

Liskin-Gasparro (1982). Testing and teaching for oral proficiency. Boston: 
Heinle and Heinle. 

Savignon, S J. (1985). Evaluation of communicative competence: the ACTFL 
provisional proficiency guidelines. Modem Language Journal, 69, 129-142. 

mikinson, L. (1988). SYSTAT: The system for statistics. Evanston, IL: 
SYSTAT 



226 

IS 



rem, fnoFiciocY ojioei.ircs rcp jrtrKirc lishenims 

(ACm. 1<7Q6> 



NO*i<c Loo 
No*Kf M«3 

lfilcnn«<fUtc 
Ini«rmcdiiic-Lo» 

ImttmcdiAic Hr|h 



iionv-Sprakin| 

r*c •'oi'«cc 5<»»i I cN«'*e»ff i/ed bjr iN< st •© COmmunKAtc fn.i.n*.!? «iih -njif 41 

Oril r<0<)uei>on cociiv>l of iioliirtj <*oiJi inJ r«ihi(>i 1 lew (ifQucniv rhuifi i tirni.aiK n.i tw» 
itiinil cOinmuni.;iii*« ibiliiy 

Oral pfoducuon coniinuci 10 eonmi of itoliicd wordt and teamed phiai«« wiihin >ttf pi (diciaSIc arcai 
need, alrhough quaniiijr 11 irKreasfd Vocabvilarjr i« lulflcient only for hartditni ti<Tip<«. ekmenui 1 <^C'Ji ar-J 
ei[K«ifn| tunc covf imei Oiierancn fatelir <ont>ti of mor< ihan i«o ihfcr sitdt and iho« f'^q .e^i <<«r4 
pauicf and repeniion of iniei Iocukx'i «oid> Speaker may ha>< tome difficulir pioducmi e>en iK« iimpi^M 
uiierancn. Some No*icc-MmJ ipcakert miII be undaiiood onlv »nh |ieai difAculiT. 

Able lo taritfjr p«rttall)r rh« requircm«ntt of batic communtcanx e«chan|n by rel>in| heavily on learnol ui- 
laancn bvi ocuitonallrcxpandini rh«« ih.fouth timple r«combma(iont of ihnr elemcnii Can aik qunuoni 
or make iiaremcnit involvini kaincd maienal Show* ti|nt of tponiancirjr althou|h (hit filli \t\on of 'fSl 
auionomT of eipecition. Speech connnun 10 eonini of teamed uiterancn laihee than of pertonaliied. iitu^- 
nonjill)r adap4ed onci VocabuUtjr cenren on areu luch as baiK object*. p<acn. and most common kinihip 
letmi. Ptonunciaiion majr inll be iironcly .nfluenced br fuit lantuafe Ettott are fieoucni and. in iriie vM 
lepetinon. tome Novtce Hi|h rpeakat will ha>e dtfncultjr being undeiiicod c<en bv ivmraihe<ic ir.iei'Ok.utv'^ii 

The fntamedtale k*e{ i« chartctentcd by ihc tpeaka'* ab<lii)r 10: 

~^eaie with ihe iMflftngt by combin>n| and ircombinni kamed ekmenit. ittoufh pnmanljr in a 'cac*>«e T.c^t. 
— iniiiare. minimAlIjr tuirain. and cto>e in a simple way buic communtcaii>e laikt. and 
— uk and aniwcr qtoKion*. 

Able to handk lucceitfulljr ■ limitrd numba of interactive. lukOnenied and tociat tiiuaiiont Can atk and 
answer questions, initiate ktid respond to timple statements and maintain fve-io-facc conversation, aiihoufn 
in a hi|hl]r rrsiricted mjtnner and wiih much linguisitc inaccuracy. Wnhin iheie ItmiiaKons. can cet foim iwih 
raskS as tnttoduoni self. Orderin| « mcaf. asking directtoru. and making pvifctuLset. Vocabulary i« adeq(<a'c 
to c*prm only the most ekmentiry needs Sirong mtcrfaence frOffl native languaie may occur M>tundcni lad- 
ings frequently arise, but *iih repeiition. the fniermedtate-Low speaker can lenerally be undersiood bv t%m 
patSeiiC inter locutOM. 

Abk to handk luceeitfully ■ vartcly of uncompltcated. buK md communicative tasks and sooat viuai>ors 
Can talk s>mply ibOui self and flmily neTtbert Can ask a(>d ans^a OuesifOns atvl panie paie -n ^"^z^ 
ve'Mtioos on topics beyOnd ihc most iTime-iiaie nerdi: e | , personal history and kisure nme ae;i..i.rs (,-. 
lerafKe kngih increases stighily. but speech may continue to bt char actaized by frequCttt font pauses. s.n(c 
ihe smooth irKorporation of rven basic conversational straiegies is ofiet hindered as the spcaka siru|| es 
CO aeate appropnaie language fof ms Pronunciation mar coniirxue to be ti(on|iy influenced by OiM languayr 
and fluenry n>ay still be tiramed. Although misundnsiandi^ts snll ause. ihe Iniei mediate ip'ater <a-t 
generally be undeisiood by stmOathetK inierloCuloM 

Able to handle tuccnsfully most uncomplicated communicaiive lasks and soaal situations Can mutaie. ivi 
tain, and close a general convatation with a number of sirategiei appropriate to a lange of cireumsunces 
and topiri. b«i( errors ate r*iden(. Limned vocabulary still necessitates hciitaiion and may bdng about stigh|l« 
uneapccted circumloculMM. There is emerging evidence of connected disctJufK. pamcularly for umple naira 
tion and/or dcKTipiion. The tniernwdiaie High tpcvkct can generally by undaiiood even by tnicftocuiois 
iw acoiMomed (O dealinf with ipeakcri at this k««l. bu( tepciHion may still b< required. 




BEST COPY AVAILABLE 





lt..el \p<iker csn 



Able 10 tiiiUy <h< rcijuiftmcni' 



t oC 1 bfOj<J Wficiy of c^cfvdiy. whocl. mJ «oik v.uino.i* Cm <J.«i.** 
of compcKnee. Then it emernnft e>i<J'n<e 



to«ct .eUnn. .o p...^u»« m,e,e«* .nd tpccl HeKJ. of compcKoce ihe.e -t em.r.mt c..«^n., 
r:;r op,n^.. e.p.... . ..... •'T,«<hc«« X^e A...nce. p.. oO.^ -Ho.- 



feci iratp of tome fofmt .nh confident mx of om- 
nkKulwn Dif(««»'ine<l *o<ibuUry i«h1 iniun .«Kin 



of ability to tupport opinio*it, 
a wtU dtwloped "bili'T 'o comjwnuie fot in tmp«» 

miy bceil down of pfO»e mideguHe 

s:,n,o':f.^ ^r^.«...< — a,...,. .......... 

.„ .p... ..n.„«, -•■v"'^7''— or;^:.:.:;;: n:.:::'.;"t:X<' 

.ome comple. h...h.freq«enCT moce com^'on o .^^^ J 

dent tffo., do not ditlufb .he n«,ve ipctkcr or mictUre -ih eommume.lKJO 



Generic Des<npilon$-Lhiettmf 



generic L/es*tn»»«u">-'-"»'»'""» 

L , t. .n >n luthenJic envtonmenC at a nofcnal .lie of tpeeeh uting tiandatd 

Thetc |«idctm« attume that alt liMemng latk* lake place m an authentic 
jr ne»f tiandard notmt 

repciniOA and/Of « »»o«« '««< tP««h 



BEST COPY AVAILABLE 



20 



228 



tnictmcdiitc fligh 



KbXt to un(l(( iiind ttioft. Uif n«J utKf in<n and tome vrnifr.cc ((n|ih uncttncct 0»>iu uIjiU <*^">- • nn 

llinnglr <v{)(Mifit undf i\iJn<iin| Jn4 nwcti it (ktHr tiiU.hIr < iiiii(ii«h«ti(]l •>in1i tn.l ff.itif ii 

iiiftpk •)u<\ii(iiit, ii*ifnifni\. tilth iKqufmy c<iiiimjn<l\ jrU lUutiftr loimuUf M*t ■'•i<iii' »■ 

irphtJiittg and/Of a \<o»<U rau of ipcech lot compf<h<ntion 

Able lO urdcrttand Xntcrxc knfih viU«an<«l «tii(h contiti of f CvomhinafHini of Icarnrd f If mcini -n > l.itiiif d 
number of COniCni afcai. pafiiCulailr if tifOrglr %uppofi<d by ihc titualiOnal <OniC«l Confcni <<l««\ lo 
hattc (X* tana I bachgdiund and ncrdt. wcial conxniiont and rOuiinC la ill. luch a\ iciung m«fcU inii 'r^ f i%>ni 
iimptr intifu(liuii% and •ttfrtimni 1 i^irn>n| laikv |i««iaiii piimaiilT in «(Minian«..tji 1*. « ut •-.^tt-A 
I. .fill I Jndrf %iandin| i\ itn .(«*»»«. ir[M'iii>iin tnti frxi'itioi mar S** "*i*l*ifT Mf\ui.«l»mii"t"-«t •» 
bOih (nam ideal and deiaih aftic li(>)u(nllir 

AhIc lo undffuand ictiterKt Icngih mifisnio «hHh eon\<tl of «rvnmbinai>oflt nf (eatnfd uit«facK<t on a xnrfT 
iif Kiptvi COiutHt (Onfinue\ lo p<>m«si<It in ha^K fxiinnal biek|«ound .niU «i<«d\. v^ut eun** tiixini 

and «nm<«ha! mOie compka latht. »uch a« 'oOf'ng. u antpo<iatK>n. aiKl ilwppinf AdUiiiOnal toni«nl 
include lom« pcrtoflal iPicicut iftd ac(i«i(ici, and a grcaici di«civiy of intKuaiont and diuciioni LitKnmi 
tM\k.\ noi only penain lo tponlancoul face io>facc con«<f Uiioni bul alio lo thof ( touunc lelcphoRC <on«(«\a 
liont and tonx Uctibctau «prrch. \uch at iinipic iftnovnccmctlit and repOKi o«ct «h« mrdia Und<f liandmi 
iOniinurt lo be uneven. 

AMe to lutiam yndeiitanding o»e» lonjei iircichc^ of eoon«»ed diKotifM on a number of lopict peruining 
lo dtffereni timet and placei: howc^t. unUeiitanding it irtcontittcnt du« to failure lO gtatp mam ideal and/ot 
dciaiH. Thut. »hiktopKtdo n« differ tignincanily fiomihOMof in AdvarKcd kirel hticnct. compfeKcntiuti 
It Int i:t quantilr and poofcr m qualitr 

Able to undertiand mam tdeat and moit detailt of connceied diteour« on a varKty of lopict beyond ihe im 
med.aer of rhe iituation. Comp»ehentio« may be uneven due lo k »«tety of Imgutttic and e«tralm|uiiii< in 
lOii. among »hi€h loptc familiarity it »ery prominertt Thw teati freqvemly in*oUe dcKnption and nan i- 
lion in different iime frlmet or atpeeti. tuch at pretent. nonp«i<. habnyaJ. or imperfeciive TeiK may include 
iniervifwt. thori leetuin on famiLar topict. and nt^x nemt and leporit primarily dealing >•<■ .aeiuat inlcr 
maiion Littener ■« a»areo( eoheii»edcvicet but may noi fc^able to ut< ihem lo lotlox ihe «q«ence of ihougiii 
<n an oral text. 



Adtan<ed Plut Able to underitand tS< main ideat of molt ipeech in a ttandard dialect; ho*e*er. ihe litiener miy noi tx 

able lo luiiain eompteheniion m e«iended diieourK -hich it peopotitto na Jly and linfuniically enmpi*. I iii*ii*« 
ttiowi an emerging a»arenctt of culiurally implied meaningi beyond the tuiface nteamngt u< ihe teai t<ui iiij> 
fad to gratp tociocuhurat nuancet ol the nmtage 

Supeiior Abk to underitand the miin ideal of all tpeech in a nandird dialect, including t«hnic*l dKCutuon m i i'<iJ 

of ipeeialitiiion. Cin follow the etiennalt of e»iended dilcoune which it propotmonally and lmguMinii"» 
iomplet, at in academic/profettional lettmgt, in lecturet. tpcethci. and reporti Liiiener »ho*t tome mo- 
preeiation of aetthelic ncrmt of target languige.of idiomi. coUoquialiimi. Jnd regitter thifnng. Able to make 
inferencn within the eultural frame»o»k of ihe latgei language. Undertranding i\ aided by an a.arenett of 
Ihe underlying orgamuiionit ttruciure of ihe oral te«r and mcludei Kntiii»itr for Ht wxial arvl euliural lefe.emet 
and itt afle<ti»eo*erionei Raiely mitunderti*ndi but may not underitand e»cctii»elr rapid, highly colloquui 
tpereh or ipeech (hat hat ttrong culiural referencet 

Oiiimiuithed Able to understand all formi and liylci of tpeech pertrnem to pertonat. tocial and proietnonal needi laitoted 

to differeW ludiencci. Sho»l ttrong tentitivity to tocial and cutturat feferencfi and aetlheiic normi by pio 
cetwni language from wilhin the OJlttiral fraine«ork. Teati inclt>d« theiter playi. Kteen ptoductwnt. editor lau. 
tympoiia. aeademie debatet. public pohey itaiementt, liierafv readingt. and motl |Oke* and punt May ha*e 
dtfTKuliy with totTK dialeett and tlang 



BEST COPY AVAILABLE 



229 2 J 



